(Almost) Automatic Semantic Feature Extraction from Technical Text
نویسنده
چکیده
Acquisition of semantic information is necessary for proper understanding of natural language text. Such information is often domain-speclfic in nature and must be acquized from the domain. This causes a problem whenever a natural fanguage processing (NLP) system is moved from one domain to another. The portability of an NLP system can be improved if these semantic features can be acquired with ];mited human intervention. This paper proposes an approach towards (almost) automatic semantic feature extraction. I. I N T R O D U C T I O N Acquisition of semantic information is necessary for proper understanding of natural language text. Such information is often domaln-specific in nature and must be acquized from the domain. When an NLP system is moved from one domain to another, usually a substantial amount of time has to be spent in tailoring the system to the new domain. Most of this time is spent on acquiring the semantic features specific to that domain. It is important to automate the process of acquisition of semantic information as much as possible, and facilitate whatever human intervention is absolutely necessary. Portability of NLP systems has been of concern to researchers for some time [8, 5, 11, 9]. This paper proposes an approach to obtain the domaln-dependent semantic features of any given domain in a domain-independent manner. The next section will describe an existing NLP system (KUDZU) which has been developed at Mississippi State University. Section 3 will then present the motimstion behind the automatic acquisition of the semantic features of a domain, and a brief outline of the methodology proposed to do it. Section 4 will describe the dlf[ereut steps in this methodology in detail Section 5 will focus on the app]icatlons of the semantic features. Section 6 compares the proposed approach to R;ml]ar research efforts. The last section presents some final comments. 2. THE EXISTING KUDZU SYSTEM The research described in this paper is part of a larger ongoing project called the KUDZU (Knowledge Under Development from Zero Understanding) project. This project is aimed at exploring the automation of extraction of information from technical texts. The KUDZU system has two primary components an NLP component, and a Knowledge Analysis (KA) component. This section desoribes this system in order to facilitate understanding of the approach described in this paper. The NLP component consists of a tagger, a semi-purser, a prepositional phrase attachment specialist, a conjunct identifier for coordinate conjunctions, and a restructuzer. The tagger is an u-gram based program that currently generates syntactic/semantic tags for the words in the corpus. The syntactic portion of the tag is mandatory and the semantic portion depends upon whether the word has any special domain-specific classification or not. Currently only nouns, gerunds, and adjectives are assigned semantic tags. For example, in the domain of veterinary medicine, adog" would be assigned the tag "nounmpatient, m "nasal" would be "adj--body-part, m etc. The parsing strategy is based on the initial identification of simple phrases, which are later converted to deeper structures with the help of separate specialist programs for coordinate conjunct identification and prepositional phrase attachment. The result for a given sentence is a single parse, many of whose elements are comparatively underspecified. For example, the parses generated lack clause boundaries. Nevertheless, the results are surprisingly useful in the extraction of relationships from the corpus. The semi-parser recognises noun-, verb-, prepositional-, gerund-, infinitive-, and s~iectival-phrases. The prepositional phrase attachment specialist [2] uses case grammar analysis to disambiguate the attachments of prepositional phrases and is highly domain-dependent. The current iraplemcntation of this subcomponent is highly specific to the domnin of veterinary medicine, the initial testbed for the KUDZU system. Note that all the examples presented in this paper will be taken from this domain. The coordinate conjunction specialist identifies pairs of appropriate conjuncts for the coordinate conjunctions in the text and is domainindependent in nature [1]. The restructurer puts together the information acquired by the specialist programs in order to provide a better (and deeper) structure to the parse. "This research is supported by the NSP-ARPA grant number IRI 9314963. Before being passed to the knowledge analysis portion of the system, some parses undergo manual modification, which is
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملA RBF Network for Chinese Text Classification Based on Concept Feature Extraction
The feature selection is an important part in automatic text classification. In this paper, we use a Chinese semantic dictionary -Hownet to extract the concepts from the word as the feature set, because it can better reflect the meaning of the text. We construct a combined feature set that consists of both sememes and the Chinese words, propose a CHI-MCOR weighing method according to the weighi...
متن کاملAutomatic ICD Code Assignment to Medical Text with Semantic Relational Tuples
Mining the Electronic Medical Record (EMR henceforth) is growing in popularity but still lacks good methods for better understanding the text in EMR. One important task is assigning proper International Classification of Diseases (ICD henceforth, which is the code schema for EMR) code based on the narrative text of EMR document. For the task, we propose an automatic feature extraction method by...
متن کاملTRECVID 2003 Experiments at MediaTeam Oulu and VTT
MediaTeam Oulu and VTT Technical Research Centre of Finland participated jointly in semantic feature extraction, manual search and interactive search tasks of TRECVID 2003. We participated to the semantic feature extraction by submitting results to 15 out of the 17 defined semantic categories. Our approach utilized spatio-temporal visual features based on correlations of quantized gradient edge...
متن کاملAn automatic approach for ontology-based feature extraction from heterogeneous textualresources
Data mining algorithms such as data classification or clustering methods exploit features of entities to characterise, group or classify them according to their resemblance. In the past, many feature extraction methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the success of the Information Society and the WWW, which has made available enormou...
متن کامل